White Wines Quality Analysis by Andre Kenji Yai

Description

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

In the above reference, two datasets were created, using red and white wine samples. The inputs include objective tests (e.g. PH values) and the output is based on sensory data (median of at least 3 evaluations made by wine experts). Each expert graded the wine quality between 0 (very bad) and 10 (very excellent). Several data mining methods were applied to model these datasets under a regression approach. The support vector machine model achieved the best results. Several metrics were computed: MAD, confusion matrix for a fixed error tolerance (T), etc. Also, we plot the relative importances of the input variables (as measured by a sensitivity analysis procedure).

Number of Attributes: 11 + output attribute

Note: several of the attributes may be correlated, thus it makes sense to apply some sort of feature selection.

Attribute information:

For more information, read [Cortez et al., 2009].

Input variables (based on physicochemical tests): - 1. fixed acidity (tartaric acid - g / dm^3) - 2. volatile acidity (acetic acid - g / dm^3) - 3. citric acid (g / dm^3) - 4. residual sugar (g / dm^3) - 5. chlorides (sodium chloride - g / dm^3 - 6. free sulfur dioxide (mg / dm^3) - 7. total sulfur dioxide (mg / dm^3) - 8. density (g / cm^3) - 9. pH - 10. sulphates (potassium sulphate - g / dm3) - 11. alcohol (% by volume) Output variable (based on sensory data): - 12. quality (score between 0 and 10)

The Problem

We will analiyze the dataset to investigate the features that makes a good wine. In this problem we will use the White Wine dataset.

Univariate Plots Section

In this section, I will perform some preliminary exploration of your dataset.

Let’s start our analysis by summarizing the data and getting to know more about the dataset.

## [1] 4898   13

In our dataset there are 4898 rows and 13 features and the features are:

names(WhiteWines)
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Let’s see the first rows of the dataset

head(WhiteWines,n=10)
##     X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1   1           7.0             0.27        0.36           20.7     0.045
## 2   2           6.3             0.30        0.34            1.6     0.049
## 3   3           8.1             0.28        0.40            6.9     0.050
## 4   4           7.2             0.23        0.32            8.5     0.058
## 5   5           7.2             0.23        0.32            8.5     0.058
## 6   6           8.1             0.28        0.40            6.9     0.050
## 7   7           6.2             0.32        0.16            7.0     0.045
## 8   8           7.0             0.27        0.36           20.7     0.045
## 9   9           6.3             0.30        0.34            1.6     0.049
## 10 10           8.1             0.22        0.43            1.5     0.044
##    free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                   45                  170  1.0010 3.00      0.45     8.8
## 2                   14                  132  0.9940 3.30      0.49     9.5
## 3                   30                   97  0.9951 3.26      0.44    10.1
## 4                   47                  186  0.9956 3.19      0.40     9.9
## 5                   47                  186  0.9956 3.19      0.40     9.9
## 6                   30                   97  0.9951 3.26      0.44    10.1
## 7                   30                  136  0.9949 3.18      0.47     9.6
## 8                   45                  170  1.0010 3.00      0.45     8.8
## 9                   14                  132  0.9940 3.30      0.49     9.5
## 10                  28                  129  0.9938 3.22      0.45    11.0
##    quality
## 1        6
## 2        6
## 3        6
## 4        6
## 5        6
## 6        6
## 7        6
## 8        6
## 9        6
## 10       6

We got that our features are numerical and most are double and it seems X the id of the row.

Lets summarize the data to know more about the mean and perticiles of each feature.

summary(WhiteWines)
##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

There are two categories of variables in this database:

We also noticed that those with higher variance are alcohol, free sulfur dioxide, total sulfur dioxide, residual sugar.

We will start analysing the quality that and those variables with high variance

Quality

Lets see how the wines were ranked.

WineQuality <- table(WhiteWines$quality)
barplot(WineQuality, main="Wine Quality Distribution", xlab="Number of Wines")

We got that Wine Quality distribution look like a Normal Distribution with the most wines was ranked with a 6 followed by 5 and the lowest received a 3 and the highest a 9.

I wonder what can contribute to that grades? So lets look now at those with higher variance and how they are distributed.

As we have a descrite and specific set of values assigned to quality, I will factorize this feature. Doing so it will help us in the visualization by allowing us to perform better boxplots and investigate relationship of features.

WhiteWines$QualityCategory <-as.factor(WhiteWines$quality)

Alcohol

The porcentage of alcohol content of the wine.

qplot(x = alcohol, data = WhiteWines)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(WhiteWines$alcohol)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

We got a left skewed histogram with the most around 10 lowest a 8.00 and the highest at 14.20. I wonder if those with more than 10 have a highest ranking of quality.

Residual Sugar

The amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

qplot(x = residual.sugar, data = WhiteWines)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(WhiteWines$residual.sugar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800
length(WhiteWines[(WhiteWines$residual.sugar < 1),]$residual.sugar)
## [1] 77
length(WhiteWines[(WhiteWines$residual.sugar > 45),]$residual.sugar)
## [1] 1

We also got a left skewed distribution with most of the data concentrated in less than 20 g/L. With the less at 0.60 and highest at 65.800 g/L and second highest at 31.60 g/L.

That been investigated we got that we have 77 less than 1 and 1 wine with residual sugar more than 45 g/L.

WhiteWines$log10.residual.sugar <- log10( WhiteWines$residual.sugar)
qplot(x = log10.residual.sugar, data = WhiteWines)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(WhiteWines$log10.residual.sugar)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.2218  0.2304  0.7160  0.6432  0.9956  1.8180

I transformed the long tail data to better understang the distribution of residual sugar. The residual sugar appears bimodal with the peaks in 0.25 and 0.8.

Total Sulfur Dioxide

Amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

qplot(x = total.sulfur.dioxide, data = WhiteWines)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(WhiteWines$total.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

It is a normal with it peak near 130 ppm, most of the data concentrated between 0-250 ppm and the minimum at 9 ppm and max at 440ppm.

length(WhiteWines[(WhiteWines$total.sulfur.dioxide > 50),]$total.sulfur.dioxide)/length(WhiteWines$total.sulfur.dioxide)
## [1] 0.9899959

We also got that around 98.9% of the dataset have a more the SO2 more present in the wine.

Free Sulfur Dioxide

The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.

qplot(x = free.sulfur.dioxide, data = WhiteWines)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(WhiteWines$free.sulfur.dioxide)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

Also a normal with concentration in beetween 0 and 100 and it peak near 50 and min value at 2 and max 289.

Lets see the others values as well. I suppose that ph and density may be also correlated.

PH

Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

qplot(x=pH, data=WhiteWines)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(WhiteWines$pH)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

We get that all wines are acid (pH < 7) and it has a normal distribution most of it at 3.1 and min 2.720 and max 3.820.

Density

The density of water is close to that of water depending on the percent alcohol and sugar content.

qplot(x=density,data=WhiteWines)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(WhiteWines$density)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

It is a normal peak at of it at 0.9937 and min at 0.987 and max near 1.04. It is concentrated at 0.9 to 1.00. That means that density of most wines are very near to the water density, that is 1.

Chlorides

Is the amount of salt in the wine. I wonder how this can affect quality.

qplot(x=chlorides,data=WhiteWines)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(WhiteWines$chlorides)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Is also a normal with peak at 0.043 and it is concentrated between 0.009 and 0,09 and the min is at 0.009 and max at 0.34. I suppose be the less than 0.04 have best quality,

Sulphates

A wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

qplot(x=sulphates,data=WhiteWines)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(WhiteWines$sulphates)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

We got a binormal with peaks near 0.4 and 0.5. Min value is 0.2 and max 1.08.

Acids

Lwts see the acid. There are three types of acids: 1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily) 2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste 3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

qplot(x=fixed.acidity,data=WhiteWines)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(WhiteWines$fixed.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200
qplot(x=volatile.acidity,data=WhiteWines)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(WhiteWines$volatile.acidity)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000
qplot(x=citric.acid,data=WhiteWines)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(WhiteWines$citric.acid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Comparasion:

In all of them we got a normal. We have more fixed acid than the others and it peak is in 7 when the other is in 0.3 and with a higher variance.

Lets aggregate the acids and see what it returns.

WhiteWines$acid <- WhiteWines$citric.acid + WhiteWines$fixed.acidity + WhiteWines$volatile.acidity
qplot(x=acid,data=WhiteWines)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(WhiteWines$acid)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.130   6.890   7.405   7.467   7.960  14.960

When analyzing the new feature we got that it is more similar to fixed aciity with it peak at 7.5.

Univariate Analysis

What is the structure of your dataset?

TODO:

In my dataset there are 4898 values with 13 features (quality, residual.sugar,sulphate,fixed.acid,t).

From our observations:

  • quality: There were more wines with quality equal to 6.
  • alcohol: Alcohol have a binormal distribution peaks on 9% and 11%
  • density: Nomal distribution concentrated betwwen 0.99 to 1.00.
  • residual sugar: Got a left skewed distribution than transformed to log10 and got a binormal with peaks at 0.25 and 0.9.
  • ph: All wines are acids and had a normal distriution and peak at 3.2
  • chlorides: A normal distribution and concentrated beetween 0.02 to 0.09

What is/are the main feature(s) of interest in your dataset?

TODO: Make a search.

The main features in the dataset are quality,alcohol and density I’d like to determine which features are better to predict the quality of a wine. I suspect that a combination of density, alcohol and others features can contribute for it.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Residual sugar,chlorides,ph,sulphates and others my contribute to the quality of a wine.

Did you create any new variables from existing variables in the dataset?

I created three differente features one for factorizing the quality of the wine, other by transfoming the residual sugar feature to log10 and other by combining the three different acids (fixed, volative and citric). I did so in other to better visualize the relations between the differents features.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

The residual sugar had a left skewed plot so I tried the log10 transformation and got a binormal distribution.

Bivariate Plots Section

We will now see the correlation matrix. To see wich variables are correlated to quality and each other.

correlation_matrix <- cor(WhiteWines[,c(1:13,15,16)])
round(correlation_matrix,2)
##                          X fixed.acidity volatile.acidity citric.acid
## X                     1.00         -0.26             0.00       -0.15
## fixed.acidity        -0.26          1.00            -0.02        0.29
## volatile.acidity      0.00         -0.02             1.00       -0.15
## citric.acid          -0.15          0.29            -0.15        1.00
## residual.sugar        0.01          0.09             0.06        0.09
## chlorides            -0.05          0.02             0.07        0.11
## free.sulfur.dioxide  -0.01         -0.05            -0.10        0.09
## total.sulfur.dioxide -0.16          0.09             0.09        0.12
## density              -0.19          0.27             0.03        0.15
## pH                   -0.12         -0.43            -0.03       -0.16
## sulphates             0.01         -0.02            -0.04        0.06
## alcohol               0.21         -0.12             0.07       -0.08
## quality               0.04         -0.11            -0.19       -0.01
## log10.residual.sugar  0.02          0.07             0.09        0.06
## acid                 -0.26          0.99             0.07        0.39
##                      residual.sugar chlorides free.sulfur.dioxide
## X                              0.01     -0.05               -0.01
## fixed.acidity                  0.09      0.02               -0.05
## volatile.acidity               0.06      0.07               -0.10
## citric.acid                    0.09      0.11                0.09
## residual.sugar                 1.00      0.09                0.30
## chlorides                      0.09      1.00                0.10
## free.sulfur.dioxide            0.30      0.10                1.00
## total.sulfur.dioxide           0.40      0.20                0.62
## density                        0.84      0.26                0.29
## pH                            -0.19     -0.09                0.00
## sulphates                     -0.03      0.02                0.06
## alcohol                       -0.45     -0.36               -0.25
## quality                       -0.10     -0.21                0.01
## log10.residual.sugar           0.93      0.07                0.31
## acid                           0.10      0.05               -0.05
##                      total.sulfur.dioxide density    pH sulphates alcohol
## X                                   -0.16   -0.19 -0.12      0.01    0.21
## fixed.acidity                        0.09    0.27 -0.43     -0.02   -0.12
## volatile.acidity                     0.09    0.03 -0.03     -0.04    0.07
## citric.acid                          0.12    0.15 -0.16      0.06   -0.08
## residual.sugar                       0.40    0.84 -0.19     -0.03   -0.45
## chlorides                            0.20    0.26 -0.09      0.02   -0.36
## free.sulfur.dioxide                  0.62    0.29  0.00      0.06   -0.25
## total.sulfur.dioxide                 1.00    0.53  0.00      0.13   -0.45
## density                              0.53    1.00 -0.09      0.07   -0.78
## pH                                   0.00   -0.09  1.00      0.16    0.12
## sulphates                            0.13    0.07  0.16      1.00   -0.02
## alcohol                             -0.45   -0.78  0.12     -0.02    1.00
## quality                             -0.17   -0.31  0.10      0.05    0.44
## log10.residual.sugar                 0.42    0.76 -0.18     -0.03   -0.39
## acid                                 0.11    0.28 -0.43     -0.01   -0.12
##                      quality log10.residual.sugar  acid
## X                       0.04                 0.02 -0.26
## fixed.acidity          -0.11                 0.07  0.99
## volatile.acidity       -0.19                 0.09  0.07
## citric.acid            -0.01                 0.06  0.39
## residual.sugar         -0.10                 0.93  0.10
## chlorides              -0.21                 0.07  0.05
## free.sulfur.dioxide     0.01                 0.31 -0.05
## total.sulfur.dioxide   -0.17                 0.42  0.11
## density                -0.31                 0.76  0.28
## pH                      0.10                -0.18 -0.43
## sulphates               0.05                -0.03 -0.01
## alcohol                 0.44                -0.39 -0.12
## quality                 1.00                -0.06 -0.13
## log10.residual.sugar   -0.06                 1.00  0.09
## acid                   -0.13                 0.09  1.00

Looking more carefully with the row quality we got that quality is positivly and more correlated to alcohol. Other feature that is correlated but negatively is with density and clorides.

sort(correlation_matrix[13,])
##              density            chlorides     volatile.acidity 
##         -0.307123313         -0.209934411         -0.194722969 
## total.sulfur.dioxide                 acid        fixed.acidity 
##         -0.174737218         -0.131377207         -0.113662831 
##       residual.sugar log10.residual.sugar          citric.acid 
##         -0.097576829         -0.064631762         -0.009209091 
##  free.sulfur.dioxide                    X            sulphates 
##          0.008158067          0.035763247          0.053677877 
##                   pH              alcohol              quality 
##          0.099427246          0.435574715          1.000000000

Lets see a visualization of this matrix.

## 
## Attaching package: 'psych'
## The following objects are masked from 'package:scales':
## 
##     alpha, rescale
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

Bivariate Analysis

Analyzing the covariation map we got that most important ones are:

Density and residual sugar

We got by a correlation of 0.84

ggplot(aes(y = density,x = residual.sugar),data = WhiteWines) + geom_point(alpha=0.5, size=1,position='jitter') + geom_smooth(method = 'lm')

We got almost like a linear correlation between the residual sugar and density points. That shows us that as density increases the residual.sugar also increase.

Density and Alcohol

ggplot(aes(y = density,x = alcohol),data = WhiteWines) + geom_point(alpha=0.5, size=1,position='jitter') + geom_smooth(method = 'lm') + scale_y_continuous(limits = c(0.98,1.04))

We can see that as alcohol increases there are less density. We can also get that the range of density is from 0.98 to 1.04 and alcohol range goes from 8 to 14.

total sulfur dioxide and density

ggplot(aes(y = density, x = total.sulfur.dioxide),data = WhiteWines) + geom_point(alpha=0.5, size=1,position='jitter') + geom_smooth(method = 'lm') + scale_y_continuous(limits = c(0.98,1.04))

We got that alsmost as density increases the total sulfur dioxide also increase.

Quality Multivariate plots

In this part we will create a plot that involves the quality of wines.

Quality and Alcohol

Lets take a closer look at the relation between alcohol and density with quality.

ggplot(aes(y = alcohol,x = QualityCategory),data = WhiteWines) + geom_boxplot() + geom_smooth(method = 'lm')

tapply(WhiteWines$alcohol,WhiteWines$Quality, summary)
## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.55   10.45   10.34   11.00   12.60 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.40   10.10   10.15   10.75   13.50 
## 
## $`5`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.000   9.200   9.500   9.809  10.300  13.600 
## 
## $`6`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50    9.60   10.50   10.58   11.40   14.00 
## 
## $`7`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.60   10.60   11.40   11.37   12.30   14.20 
## 
## $`8`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.50   11.00   12.00   11.64   12.60   14.00 
## 
## $`9`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.40   12.40   12.50   12.18   12.70   12.90

We see that as alcohol in the wines increases the quality tend to increase as well and that with alcohol percentage more than 10.8 tend to have better quality. Now lets take a look at density

Quality and density

Density is another feature that may have a correlation to quality. Lets analyse that correlation.

ggplot(aes(x = QualityCategory, y = density),data = WhiteWines) + geom_boxplot()

tapply(WhiteWines$density,WhiteWines$Quality, summary)
## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9911  0.9925  0.9944  0.9949  0.9969  1.0000 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9892  0.9926  0.9941  0.9943  0.9958  1.0000 
## 
## $`5`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9872  0.9933  0.9953  0.9953  0.9972  1.0020 
## 
## $`6`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9876  0.9917  0.9937  0.9940  0.9959  1.0390 
## 
## $`7`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9906  0.9918  0.9925  0.9937  1.0000 
## 
## $`8`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9903  0.9916  0.9922  0.9935  1.0010 
## 
## $`9`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9896  0.9898  0.9903  0.9915  0.9906  0.9970

It seems that low density tend to a better quality. As it goes to near 0.99 it’s Quality tend to increases.

Quality and Clorides

According to the matrix of correlation clorides also have a correlation with quality.

ggplot(aes(x = QualityCategory, y = chlorides),data = WhiteWines) + geom_boxplot()

tapply(WhiteWines$chlorides,WhiteWines$Quality, summary)
## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.02200 0.03625 0.04100 0.05430 0.05400 0.24400 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0130  0.0380  0.0460  0.0501  0.0540  0.2900 
## 
## $`5`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.04000 0.04700 0.05155 0.05300 0.34600 
## 
## $`6`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01500 0.03600 0.04300 0.04522 0.04900 0.25500 
## 
## $`7`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.03100 0.03700 0.03819 0.04400 0.13500 
## 
## $`8`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01400 0.03000 0.03600 0.03831 0.04400 0.12100 
## 
## $`9`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0180  0.0210  0.0310  0.0274  0.0320  0.0350

As it shown in density, the clorides as it get lower than 0.02 it tend to have better quality.

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

I noticed some relationships with quality of wines that are higher alcohol, lower density and loweer clorides tend to have a better quality.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

I noticed some relationships in the dataset. Density is increases proportionaly to residual sugar and sulfur dioxide but inverse proportionaly to alcohol.

What was the strongest relationship you found?

We got that as the density increases so do residual sugar and sulfur dioxide but inverse proportinaly to alcohol.

Multivariate Plots Section

## Warning: Removed 76 rows containing non-finite values (stat_smooth).
## Warning: Removed 86 rows containing missing values (geom_point).

## Warning: Removed 86 rows containing missing values (geom_point).

In this plot you can see that as the quality increases there are more alcohol at higher levels and lower densities.

Alcohol and cholrides

As shown with the density, as the clhorides increases there are more alcohol at higher levels and lower densities.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

The features that are strong related are alchol and quality and inversaly related to chlorides.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.


Final Plots and Summary

Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.

Plot One

Category plot

Description One

The wines were graded from 0 to 10 and the minimum got 3 and maximum 9. The distribution also got a shape of a normal with it peak at 6.

Plot Two

Category and Alcohol plot

Description Two

We got that it has a strong relationship beetween Quality of a Wine and alchol.As alcohol percentage increases so tend to increase the quality. Although the relationship is not 100% take level 5 per exmple where we have a mean leass than the 4 level of quality.

Plot Three

Category and Alcohol and Density plot

## Warning: Removed 85 rows containing missing values (geom_point).

Description Three

In this graph we compare two,density and alchol, variables correlated to quality of the wine. In this plot you can see that as the quality increases there are more alcohol at higher levels and lower densities. ——

Reflection

We got that features that are strong related to wine are alcohol, density and chlorides. Quality is proportinaly related to alcohol and inverse to density and chlorides. A suggestion to inprove the dataset will be to take into account wines from differents parts of the world. In this case of study were only considered wines from Portugal. By doing so, we can remove some bias.